Improve research search with Tantivy-backed snippets by akseljoonas · Pull Request #152 · huggingface/ml-intern

akseljoonas · 2026-04-27T13:30:04Z

What this PR does

This replaces the old Whoosh-backed search inside ml-intern's research tools with a small Tantivy-based search layer. The goal is not to add RAG or embeddings; it is to make the existing research tools return more precise, source-addressable results so the agent spends fewer tokens finding the right docs or examples.

Whoosh is unmaintained and emits Python 3.12 warnings in local runs. More importantly, the old search ranked whole docs/pages and GitHub paths, so research calls often sent the model broad results instead of the exact useful passage.

User-visible behavior

explore_hf_docs now ranks markdown passages instead of whole pages. Results include the heading and line range for the matched section.
find_hf_api now uses the same Tantivy search layer for OpenAPI endpoint search.
github_find_examples still starts from example-like files, but now also indexes source snippets from public repo contents when a keyword is provided.
GitHub example results include exact github_read_file line ranges and focused excerpts around the query terms.
Public GitHub/HF docs search no longer hard-fails just because local auth is missing or a GitHub token is rejected. Auth is still used when it works.
Network-backed research data is cached on disk under .ml-intern-cache/search by default, or ML_INTERN_SEARCH_CACHE_DIR when set.

Implementation notes

Adds agent/search/ with:
- TantivyTextIndex: small wrapper around tantivy for field-boosted BM25 search.
- markdown/code chunking helpers with source line ranges.
- JSON cache helpers for fetched docs, OpenAPI specs, repo trees, file contents, and prepared snippet docs.
Removes the Whoosh dependency and the Whoosh warning filter.
Skips raw .ipynb content indexing for now because notebook JSON produced noisy snippets and misleading line ranges; notebooks can still appear as path-level example results.

Validation

UV_CACHE_DIR=/tmp/uv-cache uv run pytest tests/unit/test_tantivy_search.py tests/unit/test_docs_tantivy_search.py tests/unit/test_github_find_examples_tantivy.py -q
- 11 passed
UV_CACHE_DIR=/tmp/uv-cache uv run python -m compileall -q agent/search agent/tools/docs_tools.py agent/tools/github_find_examples.py
- passed
Live tool checks:
- explore_hf_docs on TRL query dataset_text_field SFTConfig packing returned SFT / Packing with source lines. Cached repeat was about 0.055s.
- find_hf_api returned correct top endpoints for create repository, upload file, and space logs.
- github_find_examples on huggingface/trl query grpo trainer returned focused source snippets and cached repeat was about 0.031s.
Real CLI check:
- ml-intern --max-iterations 6 --no-stream "Research current TRL GRPOTrainer usage..." naturally called explore_hf_docs, github_find_examples, fetch_hf_docs, and github_read_file, then returned a researched GRPOTrainer answer.

Known unrelated issue

The full unit suite currently reports two existing tests/unit/test_doom_loop.py failures because tests still expect DOOM LOOP DETECTED while the runtime returns [SYSTEM: REPETITION GUARD]. This PR does not change that behavior.

Follow-up direction

This PR intentionally keeps scope to the search substrate. A natural next step is consolidating the research tools around a broader GitHub/HF interface, including model-accessible gh/hf CLI-style capabilities and more GitHub operations. The Tantivy layer here should give that future consolidation one shared precise search path instead of several independent ones.

Whoosh is unmaintained and emits Python 3.12 syntax warnings. More importantly, the existing research tools ranked whole pages/files and often forced the agent to spend tokens reading broad results before finding the useful passage. This moves HF docs, HF OpenAPI, and GitHub example search onto a small Tantivy-backed search layer with passage/snippet chunking, source line ranges, and disk caches for network-backed research data. GitHub example lookup now searches file contents as well as paths, tolerates missing or rejected GitHub tokens for public repos, and returns focused snippets that the agent can follow up with github_read_file line ranges. Constraint: Keep the PR scoped to search quality and do not introduce RAG or embedding infra. Rejected: Keep Whoosh and suppress warnings | leaves the stale dependency and weaker result granularity in place. Rejected: Index raw notebooks as snippets | raw ipynb JSON produced noisy excerpts and misleading line ranges. Confidence: high Scope-risk: moderate Directive: Treat this as the search substrate for future research-tool consolidation; broader gh/hf CLI exposure should build on this rather than reintroducing independent search paths. Tested: uv run pytest tests/unit/test_tantivy_search.py tests/unit/test_docs_tantivy_search.py tests/unit/test_github_find_examples_tantivy.py -q Tested: uv run python -m compileall -q agent/search agent/tools/docs_tools.py agent/tools/github_find_examples.py Tested: live explore_hf_docs, find_hf_api, github_find_examples calls with cached follow-up timings Tested: real ml-intern CLI research prompt exercised explore_hf_docs, github_find_examples, fetch_hf_docs, and github_read_file Not-tested: Full unit suite has two pre-existing doom-loop wording assertion failures unrelated to search.

fglogan · 2026-05-03T14:57:55Z

closed per maintainer request

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve research search with Tantivy-backed snippets#152

Improve research search with Tantivy-backed snippets#152
akseljoonas wants to merge 1 commit intomainfrom
codex/tantivy-research-search-20260427

akseljoonas commented Apr 27, 2026

Uh oh!

fglogan commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

akseljoonas commented Apr 27, 2026

What this PR does

User-visible behavior

Implementation notes

Validation

Known unrelated issue

Follow-up direction

Uh oh!

fglogan commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants